CTexploreRlibrary(CTexploreR)
library(tidyverse)
library(SummarizedExperiment)
library(ggrepel)
CAVEATS:
Problematic gene names, no precise chromosomal location, not an importable format…
Expression profiles based on old methods (no RNAseq data)
No data about DNA methylation
Recently identified CT genes not present in the database
CT Selection procedure:
CAVEATS:
Just a list of gene in supplemental data
Selection criteria not stringent (many “CT genes” are not really testis-specific)
Based on RPKM values (GTEX v6 and TCGA), not TPM
Multi-mapping problem not well addressed
No individual data about promoter DNA methylation
DT::datatable(CT_genes)
Testis-specific genes (expressed exclusively in testis) and testis-preferential genes (expressed in a few somatic tissues at a level lower than 10x testis expression) were first selected using GTEx database.
Note that some genes were undetectable in GTEx database due to
multimapping issues (these were flagged as “lowly expressed” in
GTEX_category column). A careful inspection of these genes
showed that many of them are well-known Cancer-Germline genes, belonging
to gene families (MAGEA, SSX, CT45A, GAGE, …) from which members have
identical or nearly identical sequences. This is likely the reason why
these genes are not detected in GTEx database, as GTEX processing
pipeline specifies that overlapping intervals between genes are excluded
from all genes for counting.
For these genes, as testis-specificy could not be assessed using GTEx
database, RNAseq data from a set of normal tissues were home-processed
in order to allow multimapping. Expression of several genes became
detectable in some tissues. Genes showing a testis-specific expression
(expression at least 10x higher in testis than in any somatic tissues
when multimapping was allowed) were selected, and flagged as
testis-specific in multimapping_analysis column.
RNAseq data from cancer cell lines from CCLE were used to select
among testis-specific and testis-prefential genes those that are
activated in cancers. In the CCLE_category column, genes
are tagged as “activated” when they are highly expressed in at least one
cancer cell line (TPM >= 10). But genes that were found to be
expressed in all -or almost all- cancer cell lines were removed, as this
probably reflects a constitutive expression rather than a true
activation. We filtered out genes that were not completely repressed in
at least 20 % of cell lines (TPM <= 0.1).
RNAseq data from TCGA cancer samples were used to select among
testis-specific and testis-prefential genes those that are activated in
cancers. In the TCGA_category column, genes are tagged as
“activated” when they are highly expressed in at least one cancer sample
(TPM >= 10). But genes that were found to be expressed in all -or
almost all- cancer cell lines were removed, as this probably reflects a
constitutive expression rather than a true activation. We filtered out
genes that were not completely repressed in at least 20 % of cancer
samples (TPM <= 0.1).
All selected CT genes were visualised on IGV using a RNA-seq
alignment from testis, to ensure that expression in testis really
corrresponded to the canonical transcript. For some genes for which the
canonical transcript did not correspond to the transcript that we could
see in the testis sample, we modified manually the
external_transcript_name accordingly, to ensure that the
TSS and the promoter region are correctly defined. This is particularly
important for methylation analyses that must be focused on true promoter
regions.
Example of gene were the canonical transcript was changed manually, according to transcript structure observed in testis:
Few genes were also removed at that step, as reads were not aligned properly to any referenced transcript (example below).
Genes flagged as TRUE in
regulated_by_methylation column correspond to
Genes that are significantly induced by a demethylating agent (RNAseq analysis of cell lines reated with DAC (5-Aza-2′-Deoxycytidine)).
Genes that have a highly methylated promoter in normal somatic tissues but less methylated in germ cells (WGBS analysis of a set of normal tissues).
For some genes showing a strong activation in cells treated with 5-Aza-2′-Deoxycytidine, methylation analysis was not possible due to multimapping issues. In this case, genes were still considered as regulated by methylation unless their promoter appeared unmethylated in somatic tissues or methylated in germ cells.
controled_by_methylation <- CT_genes %>%
filter(regulated_by_methylation == TRUE)
not_controled_by_methylation <- CT_genes %>%
filter(regulated_by_methylation == FALSE)
160 genes appear to be controled by methylation
147 genes appear to be controled by methylation
GTEX_expression()Allows to visualise gene expression in GTEx tissues.
testis_specific <- CT_genes %>%
filter(testis_specificity == "testis_specific")
GTEX_expression(testis_specific$external_gene_name, units = "log_TPM")
testis_preferential <- CT_genes %>%
filter(testis_specificity == "testis_preferential")
GTEX_expression(testis_preferential$external_gene_name, units = "log_TPM")
normal_tissue_multimapping_heatmap()Allows to visualise the expression values obtained by counting or not multi-mapped reads in normal tissues.
testis_specific_in_multimapping_analysis <- CT_genes %>%
filter(lowly_expressed_in_GTEX)
multimapping option set to FALSE
normal_tissue_expression_multimapping(testis_specific_in_multimapping_analysis$external_gene_name,
multimapping = FALSE, units = "log_TPM")
multimapping option set to TRUE
normal_tissue_expression_multimapping(testis_specific_in_multimapping_analysis$external_gene_name,
multimapping = TRUE, units = "log_TPM")
CCLE_expression()Allows to visualise gene expression in different histological types of CCLE cancer cell lines.
CCLE_expression(genes = CT_genes$external_gene_name,
type = c("Skin", "Lung", "Esophageal", "Head_and_Neck", "Leukemia"),
units = "log_TPM")
CCLE_expression(genes = testis_preferential$external_gene_name,
type = c("Skin", "Lung", "Esophageal", "Head_and_Neck", "Leukemia"),
units = "log_TPM")
One can select genes base on activation frequencies in CCLE cell
lines using the percent_of_positive_CCLE_cell_lines column
of CT_genes.
frequently_activated <- CT_genes %>%
filter(percent_of_positive_CCLE_cell_lines >= 5)
CCLE_expression(genes = frequently_activated$external_gene_name,
type = c("Skin", "Lung", "Esophageal", "Head_and_Neck", "Leukemia"),
units = "log_TPM")
TCGA_expression()Allows to visualise gene expression in cancer samples from TCGA (in ‘SKCM’, ‘LUAD’, ‘LUSC’, ‘COAD’, ‘ESCA’, ‘BRCA’, or ‘HNSC’ tumor types).
TCGA_expression(genes = frequently_activated$external_gene_name,
tumor = "all",
units = "log_TPM")
TCGA_expression(genes = frequently_activated$external_gene_name,
tumor = "BRCA",
units = "log_TPM")
TCGA_expression(genes = frequently_activated$external_gene_name,
tumor = "LUAD",
units = "log_TPM")
DAC_induction()Allows to visualise gene induction upon DAC treatment in a series of cell lines.
DAC_induction(genes = controled_by_methylation$external_gene_name)
DAC_induction(genes = not_controled_by_methylation$external_gene_name)
normal_tissues_mean_methylation()Gives the mean methylation level of CpGs located in each promoter region (defined as 1000 nt upstream TSS and 200 nt downstream TSS) in a set of normal tissues.
normal_tissues_mean_methylation(genes = controled_by_methylation$external_gene_name)
normal_tissues_mean_methylation(genes = not_controled_by_methylation$external_gene_name)
Note that some genes are demethylated in testis but not (or extremely weakly!) induced by 5-aza.
=> Are considered as not regulated by methylation.
genes <- not_controled_by_methylation %>%
filter(somatic_methylation == TRUE, germline_methylation == FALSE) %>%
pull(external_gene_name)
normal_tissues_mean_methylation(genes = genes)
DAC_induction(gene = genes)
! Some genes are already expressed in all control cell lines!
=> not a good model to test DAC induction for these genes…
NB: Tested their expression in HCT166 DKO, none are highly induced. (NB: few MAGE genes were added as comparison points)
OPEN QUESTION: should we include HCT116 DKO data?
load("/data/cbio/CBIOProjects/202210_CBIO_CTexploreR/datasets/HCT116DKO/RNAseq/processed_data/count_matrix/dds_MP.rda")
ensembl_ids <- CT_genes %>%
filter(external_gene_name %in% genes) %>%
pull(ensembl_gene_id)
MAGE_ids <- CT_genes %>%
filter(external_gene_name %in% c("MAGEA6", "MAGEA1", "MAGEA4", "MAGEC1")) %>%
pull(ensembl_gene_id)
as_tibble(round(counts(dds[c(ensembl_ids, MAGE_ids),], normalize = TRUE), 2), rownames = "ensembl_gene_id") %>%
left_join(CT_genes %>%
select(external_gene_name, ensembl_gene_id)) %>%
select(external_gene_name, everything()) %>%
select(-ensembl_gene_id) %>%
arrange(desc(SRR12849891)) %>%
DT::datatable()
## Joining, by = "ensembl_gene_id"
# ensembl_ids <- CT_genes %>% filter(external_gene_name %in% c("LIN28A", "LIN28B", "CT83")) %>% pull(ensembl_gene_id)
# MAGE_ids <- CT_genes %>% filter(external_gene_name %in% c("MAGEA6", "MAGEA1", "MAGEA4", "MAGEC1")) %>% pull(ensembl_gene_id)
# as_tibble(round(counts(dds[c(ensembl_ids, MAGE_ids),], normalize = TRUE), 2), rownames = "ensembl_gene_id") %>%
# left_join(CT_genes %>% select(external_gene_name, ensembl_gene_id)) %>%
# select(external_gene_name, everything()) %>%
# select(-ensembl_gene_id) %>%
# arrange(desc(SRR12849891)) %>%
# DT::datatable()
normal_tissues_methylation()Can be used to analyse more precisely the methylation of a particular promoter in normal tissues as methylation values are given CpG by CpG.
normal_tissues_methylation("MAGEB2")
normal_tissues_methylation("LIN28B")
normal_tissues_methylation("LIN28A")
TCGA_methylation_expression_correlation()Shows the correlation between gene expression and promoter methylation in TCGA samples.
TCGA_methylation_expression_correlation(tumor = "LUAD",
gene = "TDRD1")
TCGA_methylation_expression_correlation(tumor = "all",
gene = "TDRD1")
TCGA_methylation_expression_correlation(tumor = "all",
gene = "LIN28A")
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggrepel_0.9.2 SummarizedExperiment_1.26.1
## [3] Biobase_2.56.0 GenomicRanges_1.48.0
## [5] GenomeInfoDb_1.32.4 IRanges_2.30.1
## [7] S4Vectors_0.34.0 BiocGenerics_0.42.0
## [9] MatrixGenerics_1.8.1 matrixStats_0.62.0
## [11] forcats_0.5.2 stringr_1.4.1
## [13] dplyr_1.0.10 purrr_1.0.1
## [15] readr_2.1.3 tidyr_1.2.1
## [17] tibble_3.1.8 ggplot2_3.4.0
## [19] tidyverse_1.3.2 CTexploreR_0.1.0
##
## loaded via a namespace (and not attached):
## [1] googledrive_2.0.0 colorspace_2.0-3 rjson_0.2.21
## [4] ellipsis_0.3.2 circlize_0.4.15 XVector_0.36.0
## [7] GlobalOptions_0.1.2 fs_1.5.2 clue_0.3-62
## [10] rstudioapi_0.14 farver_2.1.1 bit64_4.0.5
## [13] DT_0.26 AnnotationDbi_1.58.0 fansi_1.0.3
## [16] lubridate_1.8.0 xml2_1.3.3 splines_4.2.0
## [19] codetools_0.2-18 doParallel_1.0.17 cachem_1.0.6
## [22] geneplotter_1.74.0 knitr_1.40 jsonlite_1.8.3
## [25] broom_1.0.1 annotate_1.74.0 cluster_2.1.4
## [28] dbplyr_2.2.1 png_0.1-7 compiler_4.2.0
## [31] httr_1.4.4 backports_1.4.1 assertthat_0.2.1
## [34] Matrix_1.5-1 fastmap_1.1.0 gargle_1.2.1
## [37] cli_3.6.0 htmltools_0.5.3 tools_4.2.0
## [40] gtable_0.3.1 glue_1.6.2 GenomeInfoDbData_1.2.8
## [43] Rcpp_1.0.9 cellranger_1.1.0 jquerylib_0.1.4
## [46] Biostrings_2.64.1 vctrs_0.5.1 iterators_1.0.14
## [49] crosstalk_1.2.0 xfun_0.34 rvest_1.0.3
## [52] lifecycle_1.0.3 XML_3.99-0.12 googlesheets4_1.0.1
## [55] zlibbioc_1.42.0 scales_1.2.1 hms_1.1.2
## [58] parallel_4.2.0 RColorBrewer_1.1-3 ComplexHeatmap_2.12.1
## [61] yaml_2.3.6 memoise_2.0.1 sass_0.4.2
## [64] RSQLite_2.2.18 stringi_1.7.8 highr_0.9
## [67] genefilter_1.78.0 foreach_1.5.2 BiocParallel_1.30.4
## [70] shape_1.4.6 rlang_1.0.6 pkgconfig_2.0.3
## [73] bitops_1.0-7 evaluate_0.17 lattice_0.20-45
## [76] htmlwidgets_1.5.4 labeling_0.4.2 bit_4.0.4
## [79] tidyselect_1.2.0 magrittr_2.0.3 DESeq2_1.36.0
## [82] R6_2.5.1 generics_0.1.3 DelayedArray_0.22.0
## [85] DBI_1.1.3 pillar_1.8.1 haven_2.5.1
## [88] withr_2.5.0 survival_3.4-0 KEGGREST_1.36.3
## [91] RCurl_1.98-1.9 modelr_0.1.9 crayon_1.5.2
## [94] utf8_1.2.2 tzdb_0.3.0 rmarkdown_2.17
## [97] GetoptLong_1.0.5 locfit_1.5-9.6 grid_4.2.0
## [100] readxl_1.4.1 blob_1.2.3 reprex_2.0.2
## [103] digest_0.6.30 xtable_1.8-4 munsell_0.5.0
## [106] bslib_0.4.0